Introduction

In this part, we need to distinguish supervised and unsupervised machine learning methods. Supervised machine learning methods use the information of the label or the category of a pattern of data. Unsupervised machine learning methods aim to find a pattern of groups inside the dataset.

Classification and Clustering. Classification is a supervised machine learning method while clustering is an unsupervised machine learning method.
Classification and Clustering. Classification is a supervised machine learning method while clustering is an unsupervised machine learning method.

1) Clustering

Algorithms for Clustering

  1. K-means
  2. K-medoids
  3. Density Based
  4. Hierarchical
  5. Common Applications:
  6. Recommender System
  7. Customer Segmentation

Heatmap and Hierarchical Clustering: Application in Cancer

  • Clustering can be used to separate the genetic profiles of cancer cells that drive progression or defeat of the disease.

  • ADRN neuroblastoma cells are differentiated cells similar to neurons that are killed by chemiotherapy.

  • MES neuroblastoma cells are non-differentiated cells thought to behave like stem cells.

  • MES cells are resitent to chemotherapy.

Heatmap using RNA-Seq and gene expression data

Clustering of cell differentiation markers in neuroblastoma. Figure from Groningen (2017).
Clustering of cell differentiation markers in neuroblastoma. Figure from Groningen (2017).

Heatmap using epigenetic profiles

Super-enhancers are regions of chromatin that are maintained open to facilitate expression of proteins that keep the differentiation state of the neuroblastoma cells.

Cell differentiation state is maintained by organization of super-enhancers in neuroblastoma.Figure from Groningen (2017).
Cell differentiation state is maintained by organization of super-enhancers in neuroblastoma.Figure from Groningen (2017).

2) SARS-CoV-2 dataset and heatmap

Cargar datos de frecuencia de polimorfismo

  • La Tabla de Incidencia o de Frecuencia se construye usando los archivos VCF de la pipeline de Bash
  • Visualicemos inicialmente la tabla construida a partir del genotipado SARS-CoV-2 presente en los archivos VCF
  • El genotipado se realizó despues de la alineacion y la “llamada de variantes” o variant call que es hecha empezando con archivos FASTA o FASTQ (final = VCF)

Cargar datos de frecuencia de polimorfismos

  • Como puedo contar en cuántos archivos VCF hay un SNP particular de interés, o muchos SNPs, puedo también medir la frecuencia de estes SNPs entre todos los archivos VCF que analicé en la variant call
  • La Tabla para el mapa de calor contiene los valores de frecuencia de los SNP identificados en SARS-CoV-2 en diferentes lugares del mundo y se puede construir utilizando la extracción del genotipo SNP que se muestra en el paso 3.

Load and visualize the dataset

heatmap_table <- read.table("./data/temp_file.txt", row.names = 1, header = TRUE, sep = "\t")
  • Load library to visualize table
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.2
  • Visualize data-frame
kable(heatmap_table, caption="Frequency Table of SARS-CoV-2 SNPs") %>%
  kable_styling("striped", full_width = F, font_size = 12) %>%
  scroll_box(width = "100%", height = "400px")
Frequency Table of SARS-CoV-2 SNPs
Bat Pangolin Morocco United.States Germany China Australia Brazil Italy Kenya France Caribbean Singapore Vietnam Switzerland Ghana Taiwan Argentina Saudi.Arabia Canada Finland
241 C > T 0 0 100 88 11 0 11 100 100 20 100 50 20 78 100 33 30 100 100 100 100
3037 C > T 67 0 83 88 11 0 11 100 100 60 100 50 20 78 90 33 30 100 100 100 100
11083 G > T 0 32 0 13 0 0 56 0 0 30 22 50 70 11 0 0 40 0 0 10 0
14408 C > T 0 0 100 88 11 0 11 100 100 60 100 50 20 78 90 33 20 90 90 100 100
17747 C > T 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17858 A > G 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18060 C > T 67 58 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0
23403 A > G 0 0 100 88 11 11 11 100 100 60 100 50 20 78 100 33 10 90 80 100 90
26144 G > T 0 0 0 0 0 0 33 0 0 10 0 50 0 11 0 0 50 0 0 0 0
27046 C > T 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28144 T > C 33 58 0 0 0 56 33 0 0 0 0 0 0 0 0 56 30 0 0 0 0
28881 G > A 0 0 0 13 11 0 0 29 56 10 0 0 0 44 50 22 20 40 0 0 0
28882 G > A 0 0 0 13 11 0 0 29 56 10 0 0 0 44 50 22 20 40 0 0 0
28883 G > C 0 0 0 13 11 0 0 29 56 10 0 0 0 44 50 22 20 40 0 0 0

3) Plot heatmap without normalization

Mutation positions

Before plotting the heatmap of mutation frequency, let’s look at the SNP nucleotide mutations at their positions:

Spike protein mutations of positions 23063bp (A) and 23404bp (B).
Spike protein mutations of positions 23063bp (A) and 23404bp (B).

Plot heatmap

  • Como verá a continuación, la normalización nos permite visualizar la agrupación de secuencias de murciélagos, pangolines y virus aislados en el la Lejano Oriente.

  • La normalización nos permite hacer una comparación con el analysis filogenético

Load libraries that compute the distance matrix

library("pheatmap")
library("RColorBrewer")
  • In the algorithm followed by the R package, similarity is computed based on the calculation of a distance matrix.
    • To better understand the algorith, please watch the STATQuest video. Link is present in the course website.
  • Make the dataframe a matrix
heatmap_table = as.matrix(heatmap_table)

Plot heatmap using pheatmap function

  • Note that the clusters for the rows are set to FALSE
pheatmap(heatmap_table, cluster_rows = F)

  • We can choose the color for the heatmap
col.pal <- brewer.pal(9,"Blues")
  • And print the heatmap with the color we chose
pheatmap(heatmap_table, cluster_rows = F, col.pal)

Distance matrix calculation

  • The distance matrix is calculated using the Euclidian distance between two points.

  • The R package pheatmap allows us to define the type of distance between two points.

    • Equation for calculation of the Euclidian distance between two points is can be found in Gatto (2023).
    • Watch the video to better understand the computation of the distance matrix
drows1 <- "euclidean"
dcols1 <- "euclidean"

Plot heatmap with several options of parameters

  • Change font size, cells height and width

  • Turn on clustering for the columns and the rows

    • Columns are regions or countries
    • Rows are the different types of mutations
    • There are potantially 30000 possible mutations
hm.parameters <- list(heatmap_table, 
                      color = col.pal,
                      cellwidth = 14, cellheight = 12, scale = "none",
                      treeheight_row = 100,
                      kmeans_k = NA,
                      show_rownames = T, show_colnames = T,
                      #main = "Full heatmap (avg, eucl, unsc)",
                      main = "Frequencies of SNP Variants of SARS-CoV-2",
                      clustering_method = "average",
                      ####
                      ####
                      #### SNP mutations are in the rows
                      cluster_rows = T, 
                      ####
                      ####
                      #### Different countries are in the columns
                      cluster_cols = T,
                      clustering_distance_rows = drows1, 
                      fontsize_row = 10,
                      fontsize_col = 10,
                      clustering_distance_cols = dcols1)
do.call("pheatmap", hm.parameters)

  • Note that SNPs that are often together are present in the same cluster.

  • SNPs that are very close (28881, 28882, and 28883) form a single cluster.

4) Clusters of SNPs Frequencies per region

The clustering part in r was based on the article of Kodali (2016).

clusters <- hclust(dist(heatmap_table))
plot(clusters)

Plotting the heatmap with Log Normalization

  • La visualización del mapa de calor se puede normalizar utilizando la escala logarítmica
  • Tomar el logaritmo de la representación de la expresión génica es un procedimiento estándar, ya que ayuda a homogeneizar la variación de frecuencia y reducir la dimensionalidad en la variación en la visualización del mapa de calor
  • Debido a mi experiencia en la visualización de la expresión génica utilizando mapas de color o heatmaps, decidí implementar también la normalización de la frecuencia alélica de los genotipos de SARS-CoV-2 identificados en este proyecto
library("pheatmap")
library("RColorBrewer")

heatmap_table <- read.table("./data/temp_file.txt", row.names = 1, header = TRUE, sep = "\t")
heatmap_table = as.matrix(heatmap_table)

# Define a normalization transformation
log_table_09_18_2020 = log (heatmap_table + 1)

# Plot heatmap using pheatmap function
pheatmap(log_table_09_18_2020)

# Escolher a cor do heatmap
col.pal <- brewer.pal(9,"Blues")

# Definir o tipo de correlacao entre as amostras (colunas) e os genes (linhas)
drows1 <- "euclidean"
dcols1 <- "euclidean"


#Plotar o heatmap, com as diversas opcoes determinadas
hm.parameters <- list(log_table_09_18_2020, 
                      color = col.pal,
                      cellwidth = 14, cellheight = 12, scale = "none",
                      treeheight_row = 200,
                      kmeans_k = NA,
                      show_rownames = T, show_colnames = T,
                      #main = "Full heatmap (avg, eucl, unsc)",
                      main = "Frequencies of SNP Variants of SARS-CoV-2",
                      clustering_method = "average",
                      cluster_rows = F, cluster_cols = T,
                      clustering_distance_rows = drows1, 
                      fontsize_row = 10,
                      fontsize_col = 10,
                      clustering_distance_cols = dcols1)
do.call("pheatmap", hm.parameters)

Clustering of SNPs by their frequencies (log-normalized)

  • Three important co-mutations (241C > T, 14408C > T, and 23403A > G) are in critical proteins
    • RNA replication (241C > T, 14408C > T) and
    • the S protein (23403A > G) for binding to ACE2 receptor.
    • All co-mutations occupy the same cluster, indicating the similarity of their frequencies.
clusters <- hclust(dist(log_table_09_18_2020))
plot(clusters)

library(stats)
plot(as.dendrogram(clusters))

Cluster colors

Highlight cluster as described r-charts (n.d.).

plot(as.dendrogram(hclust(dist(log_table_09_18_2020))))
rect.hclust(hclust(dist(log_table_09_18_2020)), k = 2,
            border = 3:10)

Visualizing Variants of Concern/Interest (VOC)

  • Some mutations or variants provide the virus with increased evolutionary success because of increased transmissibility resulting the interaction between the humna ACE2 protein and the viral Spike protein.

  • Polar interactions between the SARS-CoV-2 Spike RBD protein (white) and the human ACE2 protein (blue) calculated by Pymol using the mutagenesis tool.

A) Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2; B) When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the higher affinity between the virus Spike protein and ACE2.
A) Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2; B) When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the higher affinity between the virus Spike protein and ACE2.

References

Gatto, Laurent. 2023. Chapter 9 Unsupervised Learning: Clustering. https://uclouvain-cbio.github.io/WSBIM1322/sec-biostrings.html.
Groningen, Tim van. 2017. Neuroblastoma Is Composed of Two Super-Enhancer- Associated Differentiation States. Nature Genetics. Vol. 49.
Kodali, Teja. 2016. Hierarchical Clustering in r. https://www.r-bloggers.com/2016/01/hierarchical-clustering-in-r-2/.
r-charts. n.d. Hierarchical Cluster Dendrogram with Hclust. Accessed March 26, 2024. https://r-charts.com/part-whole/hclust/.